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With the increase in demand for solar power, a solar power forecasting 
model is of maximum importance to allow a higher level of integration of 
non-conventional energy into the existing electricity grid. With the 
advancement in data availability, there’s a good time to use data-driven 
algorithms for enhanced prediction of solar energy generation. Gathering 
and analyzing data can predict solar energy generation and mitigate the 
impact of solar intermittency. During this research, we explore automatically 
creating prediction models that are site-specific utilizing machine learning to 
generate solar radiation from meteorological station weather forecast reports, 
and from the predicted solar radiation corresponding solar power output can 
be calculated depending upon the characteristics of the solar PV system 
used. The challenge is to enhance the accuracy of the forecast. Ensemble 
techniques like random forest (RF) and extreme gradient boosting 
(XGBoost) are well suited for solar radiation prediction as they improve 
stability as well as combine several machine learning models to reduce 
variation and bias which outperforms the majority of models, as a result 
making them a perfect model in the field of solar energy prediction. 
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NOMENCLATURE 

ANN : Artificial neural network 

ARIMA : Autoregressive integrated moving average 
CRO-ELM : Coral reefs optimization-extreme learning machine 
ELM : Extreme learning machine 

GANN : Genetic algorithm neural network 

GGA-ELM : Grouping genetic algorithm-extreme learning machine 
GRNN : Generalised regression neural network 

KELM : Kernel extreme learning machine 

MABE : Mean absolute bias error 

MAE : Mean absolute error 

MAPE : Mean absolute percentage error 

MBE : Mean bias error 
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MSE : Mean square error 

R2 : R2 score 

RF : Random forest 

RMSE : Root mean square error 

RRMSE : Robust root mean square error 

SOM-OPELM : Self-organizing feature map—optimally pruned extreme learning machine 
SVM : Support vector machine 


1. INTRODUCTION 

The primary objective of utilizing non-conventional energy systems is to mitigate global climate 
change, provide more access to energy and improve energy security [1]. By using sustainable energy and 
guaranteeing that all residents have access to inexpensive, dependable, sustainable, and contemporary energy, 
sustainable development can be made possible [2]—[4]. Considering the stress and needs, solar power 
radiation can be an optimal solution for non-conventional energy sources [5]. The challenges and limitations 
of solar power are reduced significantly due to advancements in photovoltaic technologies which in turn 
increases the efficiency of energy conversion and reduces panel installation and electricity cost notably. Solar 
power is the future, considering the very fact that it is an inexhaustible energy source and requires low 
installation costs [6]. PV panels often cannot provide stable electric power output due to variations in weather 
conditions, the facility grid stability is reduced significantly while integrating the photovoltaic power into the 
power grid [7]. As a stable power system is a necessity, a particular forecasting technique is required for the 
stable and safe integration of the photovoltaic power into the power grid [8]. In addition to helping to 
stabilize the grid, a precise forecasting model technique is essential for managing storage, developing an 
energy road map for congestion management, and estimating reserves. The availability of data has made it 
possible to use deep learning and machine learning techniques [9]. 

In order to manage energy use optimally in the present, safely operate power systems, and balance 
consumption and production, the predictive analysis will be crucial [10]. The goal of the vision is to 
introduce and use the most recent technology to create electrical networks that are more secure, effective, 
eco-friendly, and reliable [11]. With the advancement in accurate forecasting of meteorological and 
hydrological variables like precipitation, evaporation, and temperature humidity has made it possible for 
predicting solar energy generation in a more efficient manner [11]. PV prediction models can be used by 
consumers to coordinate their use with on-site power generation and, as a result, maximize their profitability. 
One of the key advantages of data-driven models is that they utilize less time to make judgments on power 
system planning and take less time to perform predictions [12]. Accurate prediction of energy produced by 
PV systems has been identified together with the major challenges as it allows grid operators to manage 
electricity generation by making informed decisions which in turn reduces the uncertainties and cost. Several 
forecasting methods presented in the literature are described in Table 1. 


Table 1. Several forecasting methods earlier proposed by researchers 


Best Predictive se Input variables Study Performance indices Time scale 
models models area 

[13] ELM ANN Altitudes, latitudes, longitudes, Turkey R2, MBE, RMSE Daily 
land-surface temperatures 

[14] ELM SVM Ozone, cloudiness Spain R2, MSE Daily 

[15] CRO-ELM SVM Ozone, water vapor Spain R2, MAE, RMSE, Daily 

Bias 

[16] ELM ARIMA, SVM Air temperature, relative Italy Bias Daily 
humidity 

[17] KELM SVM Maximum and minimum air Iran MABE, RMSE, Daily 
temperature RRMSE 

[18] ELM SVM, ANN Sunshine duration, Air Tran R2, MABE, MAPE, Daily 
temperature RMSE, RRMSE 

[19] GGA-ELM ELM Long-wave radiation, long- Spain RRMSE Hourly 
wave flux, short-wave flux, 
wind velocity, cloud fraction, 
water vapor, air temperature 

[20] SOM-OPELM ARIMA Sunshine duration USA MAPE, MAE, MBE, Hourly 

RMSE 
[21] ELM, GANN GRNN, RF Sunshine duration China MAE, RMSE Daily 
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One of the most predominantly used machine learning techniques, the support vector machine has 
been used to a great extent in building energy and for predictions of non-conventional energy applications. 
Even with a small sample of datasets, the approach is incredibly efficacious for resolving non-linear 
problems. In order to reduce the generalization error's upper bound, which is made up of the sum of the 
training error and a confidence level, the support vector machine utilizes the structural risk minimization 
(SRM) concept [22]. Applying the fundamental principles of Support Vector Machine to regression problems 
entails adding a kernel function, non-linearly transforming the input space into a higher-dimensional feature 
space, and then performing a linear regression in this feature space [23], [24]. But still, this technique has 
promised researchers over the years over certain datasets. It was discovered that the suggested technique 
outperformed the ANN in terms of performance. Shi et al. [25] also suggested a support vector machine and 
weather classification-based PV forecast approach. The outcomes demonstrated that the suggested prediction 
model for grid-connected systems was successful and promising. Utilizing support vector machines, Yousif 
and Kazem research [26] developed solar photovoltaic power output. The proposed model projected 
photovoltaic current using inputs such as solar radiation and ambient air temperature. Kazem and Yousif [27] 
employed a support vector machine model and evaluated its performance in comparison to multi-layer 
perceptron, and generalized feed-forward networks (GFF). 

In artificial intelligence (AI) and machine learning, deep learning models are regarded as a new 
paradigm of learning. Recent years have seen a substantial increase in interest in deep learning due to its 
ability to handle complex data. Various architectures of ANN’s from simple ANN networks to more complex 
models like an auto-encoder [28] and Long short-term memory (LSTM) [29] networks have been effectively 
and successfully utilized to forecast renewable energy. A method based on artificial neural networks was 
developed by Hiyama and Kitabayashi [30] to predict the maximum power output from a PV system. The 
author’s input features included solar radiation, wind speed, and outside air temperature [31]. A hybrid multi- 
layer feed-forward neural network was created by Sulaiman et al. [32] to estimate the output from a grid- 
connected PV system. 

The published works show that many approaches have been used to predict solar power or solar 
radiation output. Amongst the employed techniques artificial neural network is one of the favored techniques. 
However, an artificial neural network requires the user to provide several model parameters, for instance, the 
number of neurons in hidden layers, the number of hidden layers, and the number of training epochs. Two of 
the most prominent machine learning techniques support vector machines and artificial neural networks, 
showed instability issues [33]. Due to the instability, even slight changes in the input data could cause 
significant variances in the anticipated values. To overcome these instability issues, a more advanced 
machine learning algorithm, Ensemble Learning was developed. 

Ensemble learning is a machine learning technique that involves training many base learners and 
combining their output to address a single problem. The fundamental tenet is that the aggregate output of the 
weak learners or the base learners should generally be more accurate than the output of any one learner. Very 
little research has been done on ensemble techniques for predicting solar radiation, which indicates that 
ensemble-based methods like RF and gradient boosting and extreme gradient boosting techniques have not 
been extensively studied. Considering that they are able to overcome the shortcomings, ensemble-based 
strategies typically outperform individual learners who build them [25]. The approach of ensemble 
techniques has drawn a lot of interest and is now in demand across various industries. Utilizing ensemble- 
based methods with solar photovoltaic systems motivates because the majority of the earlier research works 
are centered around regressive methods, support vector machines, and artificial neural networks and their 
variants, and these ensemble-based algorithms are more computationally efficient in comparison to the other 
widely used algorithms. The paper performs the research on the performances of two major Ensemble 
technique methods i.e., RF and XGBoost with Hyper-parameter tuning in solar radiation prediction from 
future weather data produced by the meteorological station. 


2. THEORY OF FORECASTING MODEL 
2.1. Random forest (RF) 

RF is an ensemble-based bagging machine learning algorithm comprising a significant number of 
decision trees. Decision trees are used as the base learners or the weak learners in the RF models. The 
working of a RF is depicted in Figure 1. In RF, the performances of individual base learners i.e., the decision 
trees are boosted by the aggregation of individual tree results. The main trademark of RF is random feature 
sampling and random row sampling while selecting a set of rows and features from a dataset to train a 
particular decision tree. Cross-validation is not necessary when using RF because they can do out-of-bag 
error estimation as part of the forest-building process. By randomly selecting data from the initial training 
dataset with replacement, RF initially creates numerous additional training data sets. The size of the new 
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training dataset is the same as the previous ones, however, sampling with replacement may result in some 
observations being duplicated [34]. Each decision tree model has high variance as gets trained with particular 
samples but as the final output depends on the aggregation of all the individual model’s output, the final 
output has low variance. 


2.2. XGBoost 

XGBoost is an optimized distributed gradient boosting ensemble machine learning algorithm that 
uses a gradient boosting framework. The working of XGBoost is depicted in Figure 2. At the University of 
Washington, a research endeavor led to the creation of the XGBoost algorithm which led to a major 
advancement in the Machine Learning domain. XGBoost also uses Decision trees as its base for weak 
learners. In XGBoost, the Decision trees are built in a sequential manner. The weak learners are trained 
sequentially and each weak learners are, therefore dependent on each other [35]. In XGBoost, weight plays a 
significant role. Before being fed into the decision tree that predicts the results, each independent variable is 
given a weight. The weight of the variables that were incorrectly predicted by the base or the weak learners 
are increased and after that, they are fed to the next subsequent learner. These distinct weak learners are then 
combined to produce a model that is more precise and accurate. The high execution speed out of the core 
computation of XGBoost makes it a favorite among data scientists [36]. 


TRAINING DATA 
n observations, m predictors 


Sample 1 
k Bootstrap 


Samples InBag 1 InBag 2 00B 2 InBag k 
(2/3) (2/3) {1/3) (2/3) 


TEST DATA 
n-N samples 
m predictors 


Average of Single Tree Predictions 


Figure 1. Working of RF 
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3. METHODOLOGY 
3.1. Data Description 

These datasets are meteorological data from the Hawaii Space Exploration Analog and Simulation 
(HI-SEAS) weather station between mission IV and mission V on which the model is trained and tested to 
get the best result possible. Weather parameters in the dataset include temperature, humidity, day time, wind 
speed, sunrise/sunset time, wind direction, and barometric pressure. The input dataset contains 15 min 
interval between each instance of weather parameters data. The model predicts the Solar Radiation as output. 
The model output calculated further with characteristic parameters of PV panel used gives the Solar Power 
output. The everyday hourly values of radiation are shown in Figure 3 and Figure 4 represents the 
combination of pressure and temperature for different values of radiation. 

Figure 5 represents the correlation plot between all the weather parameters. The correlation matrix 
represents the correlation coefficient between two variables which describes the extent of the linear 
relationship between them. The diagonal elements of correlation matrix will have value 1 because it is the 
cross-section of same weather parameters. The values near to 0 indicate that the features are very minimally 
related to each other while the values near 1 and -1 indicated the features are maximally related to each other. 
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Figure 3. Everyday hourly values of radiation 
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Figure 4. Pressure and temperature for different value of radiation 
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3.2. Evaluation indices 
3.2.1. Root mean square error (RMSE) 

The standard deviation of the residuals is known as the root mean square error (RMSE). The 
residual is a fraction of the distance from the fallback line, which is the information hotspot. RMSE is the 
percentage of how much these residuals are fanned out. At the end of the day, you'll see how the information 
might be best suited. Mean squared error is commonly used to validate experimental results in climatology, 
estimation, and multivariate studies. The formula says: 


RMSE = IF Z0)? (1) 


where f is predicted values (model output), o is actual values. The mean is indicated by the bar above the 
squared differences. The slightly different notation can be used to write the same formula as follows: 


(2) 


1 
N —z..)7|2 
RMSE = jee Zoi) | 


where, (Zs — Zoi)? is differences, squared, and N is the sample size. 
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Figure 5. Correlation plot between all the weather parameters 


3.2.2. R? score 

The coefficient of determination, also known as the R? score, is used to evaluate the regressive 
model’s accuracy. It operates by calculating the variation in the predictions that the dataset can explain. It is 
used to determine how accurately the model predicts the observed results based on the ratio of the total 
deviations of the results it describes, and it is expressed as, 


(3) 


where SSior denotes the total sum of errors and SS;es denotes the sum of squares of the residual errors. 


3.3. Hyperparameter tuning 

Hyperparameter tuning consists of finding a setup optimal hyperparameter values for a learning 
algorithm while applying this optimized algorithm to any dataset. That combination of hyper-parameters 
maximizes the model’s performance and minimizes the predefined loss function to supply better accuracy. 
Hyper-parameters are specific to algorithms themselves, so we will calculate their values from the data. We 
use hyperparameters to calculate the model parameters. 


Global solar radiation forecast using an ensemble learning approach (Debani Prasad Mishra) 


502 o ISSN: 2088-8694 


3.3.1. Random search cross-validation 

The most effective method for discovering the ideal collection of hyper-parameters for a machine 
learning model is random search. Using random draws from a specified set of hyper-parameter distributions, 
the randomized search meta-estimator algorithm trains and assesses a number of models. After training N 
distinct models with various randomly chosen hyper-parameter combinations, the algorithm chooses the best 
successful version of the model it has seen, giving you a model trained on a nearly ideal set of hyper- 
parameters. 


3.3.2. Optuna 

Optuna is a software framework for automating the optimization of hyper-parameters. By utilizing 
several samplers, including grid search, random, Bayesian, and evolutionary algorithms, it automatically 
determines the ideal hyper-parameter values. We can pass any Machine Learning algorithm as 
hyperparameters in Optuna and it will give the algorithm that gives the best result along with its 
hyperparameters. 


4 MODEL TRAINING AND PREDICTION RESULTS 
4.1. Random forest (RF) regression model 
4.1.1. Training and Hyperparameter tuning 

The initial form of the data is x*y format, where x stands for the number of features and outputs and 
y for the overall number of instances. In this phase, the dataset is separated into training and testing sets. The 
effectiveness of the studied RF algorithms depends on the adjustment of hyperparameters, i.e., number of 
trees, number of features to consider at every split, minimum numbers of samples required to split a node, the 
maximum number of levels in the tree, minimum number of samples required at each leaf node and bootstrap 
(A dataset is randomly sampled with replacement using the statistical resampling approach known as 
“bootstrapping”). The Random search CV hyperparameter tuning gives the combination of the best set of 
parameters for a more accurate model. 


4.1.2. Prediction and result 

The predictive performance of the RF regressor model is illustrated in Figure 6. The graph illustrates 
the plots of radiation values predicted by the RF model at different unix times vs measured values (actual 
value) in the testing dataset. The outcomes illustrate the level of a linear relationship and demonstrate how 
accurate the model can forecast solar radiation. At some unix time, larger discrepancies between real and 
anticipated values are seen due to a higher variation of solar radiation. In spite of that, the built RF model 
showed strong non-linear mapping generalizations ability and can be efficient in the prediction of solar 
radiation. The model was evaluated using R? score evaluation indices which resulted to be 0.809 and RMSE 
score came out to be 108.55. 


Performance of the model in prediction of Radiation for corresponding UNIXTime 
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Figure 6. Radiation values predicted by RF model at different unix time vs measured value 


4.2. XGBoost model 
4.2.1. Training and Hyperparameter tuning 

The initial form of the data is x*y format, where x stands for the number of features and outputs and y 
for the overall number of instances. In this phase, the dataset is split into the training set, validation set, and 
testing set. While fine-tuning model hyperparameters, the validation set is used to provide an unbiased 
evaluation of a model fit on the training dataset. 
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The adjustment of hyperparameters increases the accuracy of the XGBoost model, i.e., learning rate 
(learning rate, simply refers to how quickly the model learns), early stopping (validation metric at least 
improve once in every round(s) to continue training), evals (it is a list of validation sets for which metrics 
will be evaluated during training), depth of the tree, ‘ʻnum boost_round’ (number of trees to build). Optuna 
hyperparameter tuning is used to optimize the model parameters. During optimization, at each iteration, a 
new set of parameters is created and their loss value is evaluated. The set of parameters with less loss value is 
chosen as the best set of parameters. Then the model is built using those sets of parameters. 


4.2.2. Prediction and result 

The predictive performance of the XGBoost model is illustrated in Figure 7. The graph illustrates the 
plots of radiation values predicted by the RF model at different Unix times vs measured values (actual value) 
in the testing dataset. The outcomes illustrate the level of a linear relationship and demonstrate how 
accurate the model can forecast solar radiation. The statistical evaluation indices, R> and RMSE were used to 
appraise the model’s performance as represented in Table 2. The XGBoost model is observed to show higher 
fluctuation in predicting solar radiation than the RF model. The model’s R? value came out to be 0.64 and 
RMSE score to be 122.12. 
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Figure 7. Radiation values predicted by RF model at different unix time vs measured value 


Table 2. Statistical evaluation indices 
Performance Indices 

R? RMSE 

RF regressor 0.809494 108.55 
XGBoost regressor 0.645419 122.1278 


Models 


5. CONCLUSION 

In this research work, the practicability of deploying tree-based ensemble methods (RF and XGBoost) 
to predict solar radiation which in turn evaluates the photovoltaic system power output. The capability of 
ensemble technique methods for predicting solar radiation has been verified with model prediction 
performances. Ensemble algorithms were shown to marginally outperform other popular machine learning 
techniques. The work also aimed to use tree-based ensemble methods to explain the significance of the input 
attributes. Based on several weather parameters, the developed machine learning models can be used to 
forecast solar radiation. Both RF (internal cross-validation) and XGBoost perform cross-validation and can 
be used to manage datasets with large dimensions. The modeling strategy is demonstrated a reliable one that 
can be used for real-time solar radiation prediction. There is still room for improvement which can lead to 
more accurate models. One of the future work directions is to extend the existing work for more generalized 
datasets. 
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